CAGEF_services_slide.png

Advanced Graphics and Data Visualization in R

Lecture 06: Trees, Network Diagrams and Genomic Representations

0.1.0 An overview of Advanced Graphics and Data Visualization in R

"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.

This lesson is the sixth and final lecture in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.


0.2.0 Lecture objectives

This week will focus on commonly used visualizations related to genomic information. When working with sequencing data, you may commonly wish to compare sequences based on their relationships or relative similarity (trees), by their sequence identity in gene families or potential interactions (graphs), and or more directly their sequence motifs.

At the end of this lecture you will have covered the following topics

  1. Phylogenetic Trees
  2. Network graphs
  3. Genome sequences and markers

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink


0.4.0 Data used in this lesson

Today's datasets will focus on SARS-CoV-2 variant surveillance data from the Nextstrain website which has been tracking published sequenced genomes for the appearance of new strains mainly in North America.

0.4.1 Dataset 1: nextstrain_ncov_north-america_canada_timetree.nwk

This is a Newick format data set describing a phylogenetic tree of SARS-CoV-2 strain information.

0.4.2 Dataset 2: nextstrain_ncov_north-america_canada_metadata.tsv

Metadata that accompanies the first dataset. It links strain information back to as much geographical and related information as possible.


0.5.0 Packages used in this lesson

repr- a package useful for altering some of the attributes of objects related to the R kernel.

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

RColorBrewer has some hlepful palettes that we'll need to colour our data.

ggnewscale will be helpful in generating multiple colour palettes across increasingly complex plots.

treeio, tidytree, and ggtree will be used to help import, parse and plot phylogenetic trees.

tidygraph and ggraph will be used to generate network graph objects and plot them.

ggseqlogo is used for generating sequence logos related to sequence motifs.

qqman is a wrapper package for generating Manhattan plots.

lubridate and zoo help us to work with some date-based information.


1.0.0 What's in a name? A phylogenetic tree is just a rebranded dendrogram

Dendrograms, cladograms, and phylogenetic trees all share a similar structure with some modifications. All are represented in branching tree structures that are used to represent the relationships among the leaves, also known as tips. Branches along the tree may also be referred to as edges.

Leaves can represent different species, strains, or multi-dimensional values. Leaves are connected by branches to their nearest neighbours or relatives. As you move backwards along the tree, you encounter internal nodes which can connect directly to more tips or more nodes. The distance between tips and nodes represent a relative distance that may be defined as some type of evolutionary distance, time, or simple euclidean distance. In a cladogram, however, distances along branches have no meaning and only define the presence of a relationship.

In tree terminology, it is helpful to define a few more terms:

Term Description
Root A tree is rooted when it defines a common ancester for all tips in the tree. This could be considered the start or entry point for the tree
Most recent common ancester (MRCA) This is an internal node that collectively joins two or more tips.
Clade A group of taxa (tips) that includes a common ancestor and all of its descendents.
Scaling This indicates that the branch lengths do represent a distance metric, whereas, an unscaled tree may have even-length branches not representative of evolutionary/relationship distances.

Phylogenetic-Tree-542x420.png Diagram of phylogenetic tree terms from 10 differences between cladograms and phylogenetics trees


1.1.0 Where does tree data come from?

As we saw last week, clustering analysis such as that from hclust() can create dendrogram data. We plotted this in a couple of ways by itself and along with a heatmap. In the case of our first foray, we were looking at the "relationships" between samples by indicating how similar they were based on the characteristics in their various features.

Depending on the software used, trees can be represented in a number of formats, some of which are described below.

Format Description
Newick The standard used to represent trees in a computer-readable format. Trees are encoded in a parentheses-closure format where each tip takes the form of "taxa:branch-length" and pairs of tips are separated by a comma , and enclosed by parentheses (). Internal nodes branch lengths are defined outside the parentheses with "():branch-length"
NEXUS/phylip Incorporates the Newick tree with separate related data that is partitioned into different blocks. Blocks are started with "BEGIN \<BLOCK NAME>;" and closed with "END;"
NHX New Hampshire eXtended format is also based on Newick format but instead of code blocks uses a tagging system for each node extending the format to "taxa:branch-length[&&NHX:<tag information>]"

1.2.0 Using read.tree() from the treeio package

When opening tree files, depending on the format, you can use the treeio package and one of it's many parsing functions but the simplest way to avoid figuring out how to open a tree file, is to just use the read.tree() function. This will determing the appropriate file type and parse through the many different tree formats before depositing the data into a phylo object.

Today we'll be working with a dataset from the Auspice COVID-19 North American dataset maintained by Finlay Maguire

Let's begin by opening up the tree data and some related metadata before taking a look under the hood.


1.2.1 The phylo object is a simple list

From the looks of our structure, after reading in our Newick file, we have a list with six elements. They are pretty clearly named with edge information (a matrix of paired numbers), edge lengths, a count of the number of nodes (ie internal points where branches bifurcate. While mostly just numbers, some of these node.label values appear to be based on travel history information. All of the 6,627 tip.label values correspond to 6,627 SARS-CoV-2 strain names found in the metadata file as well.


1.2.2 Convert the tree object to a tibble for adding external information

Suppose we want to combine some information from our metadata with our tree? We can then use this information to colour, shape, and bring more visual order to our tree. Working with the tree, however, can be a little hard given that we have lists of different lengths representing different portions of the tree.

In our case, we only have tree tip information that we want to add to and the easiest way to work would be if it was in a table format. The package tidytree provides a function as_tibble that will parse and convert the phylo format to a tbl_tree object, which is a kind of tidy data frame.

Let's convert our variant phylo object to something we can work with.


1.3.0 Updating our tree with external information

As you can see our SC2_variants_time.tb object is a simply data frame now a 4-column table with 12,777 rows. The edge data is encoded between the parent and node columns with branch.length to define the length of each edge. We also have all of the tip and node names stored under the label column. With the tree in this format, we can now treat the tree like a tibble and join additional information to the tree.

Rules to keep in mind:

  1. Don't lose any node information from your original tree structure.
  2. Try to work with a complete dataset of information although sometimes it doesn't make sense that you would have it all.
  3. Convert your tibble back to a tree format when you're done with as.treedata

You may have informaiton about the leaves or tips of your tree but by default there is no real internal node information when it comes to sample origins or lineage. When joining information to the tables, use the correct direction *_join() to avoid data loss. A full_join() is probably the safest.

Let's take a quick look at our metadata and identify what we're interested in.


1.3.1 Pick the tip attributes that are most interesting or relevant

For our purposes it looks like we will want to explore our data with:

Let's pull that information down and add it to our tbl_tree before converting it to a treedata object.


1.3.1.1 How to access S4 object types

Take a quick look at the structure of this treedata object. What do we see? The treedata class has converted our tibble back into an S4 object with 11 slots. Each slot, is essentially a placeholder for another object. We can access these slots using the @ operator and we can further access sub elements with the $ operator.

For instance, we can see there is a phylo slot which looks suspiciously like our original SC2_variants_time.phylo object. The rest of our data frame information, which we just added to the tbl_tree before converting it is stored in the data slot. There are additional slots where we can add sequencing information for comparison in different types of phylogenetic tree visualizations.


1.4.0 Plot your tree with ggtree() from the ggtree package

Now that we've put some extra data into our tree, we are ready to plot it with the help of the ggtree() function from the ggtree package. If you haven't noticed yet, treeio, tidytree and ggtree form a suite of packages that we can use to import, alter, and visualize our trees.

Parameters for ggtree() include:

We'll also use theme_tree2() which will automatically add an x-axis scale to our tree.

We'll start simple by just visualizing all 6,627 tips of our tree.


1.5.0 Filter your tree using drop.tip()

Looking at our visualization, there are a lot of tips to display and we've coloured the tree along the branches. Another thing we can see is that there is an "NA" value. This mostly represents the branches of the tree connecting nodes internally since they do not have a specific lineage associated from our data.

As it stands, there's just way too much data here to look at and work with. We can trim that data using the drop.tip() function without worrying about how the tree is parsed. Internally, all that pruning will take place within the function. To do this, we'll generate a list of tips we want to remove first.


1.6.0 Update our pruned tree and add some extra geom_* layers

The ggtree package brings a number of ggplot2-compatible geoms to our finger tips. We'll spruce up our tree with a some tip labels to begin with.

Component type Layer type Command Description
Tree Geom geom_tree tree structure layer, with multiple layout supported
Tree Geom geom_treescale tree branch scale legend
Node Geom geom_nodepoint annotate internal nodes with symbolic points
Node Annotation geom_range bar layer to present uncertainty of evolutionary inference
Node Annotation geom_rootpoint annotate root node with symbolic point
Branch Annotation geom_label2 modified version of geom_label, with subsetting supported for labelling branches
Branch Annotation geom_segment2 modified version of geom_segment, with subsetting supported
Taxa Geom geom_point2 modified version of geom_point, with subsetting supported
Taxa Geom geom_tippoint annotate external nodes with symbolic points
Taxa Annotation geom_taxalink associate two related taxa by linking them with a curve
Taxa Annotation geom_text2 modified version of geom_text, with subsetting supported
Taxa Annotation geom_tiplab layer of tip labels
Taxa Annotation geom_tiplab2 layer of tip labels for circular layout
Clade Annotation geom_balance highlights the two direct descendant clades of an internal node
Clade Annotation geom_cladelabel annotate a clade with bar and text label
Clade Annotation geom_hilight highlight a clade with rectangle
Clade Annotation geom_strip annotate associated taxa with bar and (optional) text label

1.6.1 Replace labels with points using different geoms

In a tree this packed, you can't really read the tip labels. The data that's most helpful, however, is the colouring. You can replace your labels with points in 3 ways:

If we use geom_point() we can colour/annotate all of the nodes and tips but we already know that there isn't any phylogenetic information associated with our internal nodes. Likewise, geom_nodepoint() will allow us to colour the nodes but there isn't any grouping information associated with that.

Therefore we'll use geom_tippoint() to colour our tips based on their Nextstrain clade information and we can alter the tip shape to match the province where the strain was sequenced.


1.6.2 Fix your x-axis with the mrsd parameter

As you can see we are working with a time-based axis but it all appears to be on a relative scale and what we'd really like to see is real-world time. In order to do that, we can assign the mrsd parameter to the most recent sampling date in our dataset. We'll also set our parameter as.Date in order to see the date in a calendar format rather than a decimal-date format.


1.7.0 Thin out the tree some more with viewClade() and other information functions

Looking at the tree there are still quite a few nodes but we can zoom in on a specific clade using the viewClade() function. To accomplish this we need to access the MRCA of a group. At the same time, we should talk about ways to traverse or navigate this tree. Here's where tidytree offers up a few additional functions for searching your tbl_tree. These take on the form of function(tbl_tree, node_number) or function(tbl_tree, node_label)

function Description
child() Find the children of an internal node.
parent() Find the parent node of a tip or other node.
offspring() Find all of the offspring of a node.
ancestor() Find all of the ancestors of a tip or node
sibling() Find the nearest sibling of a tip (or node)
MRCA() Find the most recent common ancestor between two or more nodes

Let's filter our taxa list a little more and then find the MRCA between those tips.


1.7.1 Pass a ggplot2 object to viewClade()

Now that we have an MRCA node (727) that we can use to determine the specific clade we can see the tree will still be quite large. To view this particular clade, we build our plot as before, and then zoom into it afterwards with the viewClade() function.

Unfortunately, we'll have to ditch our date scale and revert to a decimal-date. If you're not using a calendar-based date, it's not a problem at all. If we look carefully at the data, we'll see that the internal nodes have an NA-value. If you were industrious enough, you could use the branch lengths to calculate traverse the tree and calculate dates for all the internal nodes as well. This would likely solve the issue.

Let's drop the tip names back in there too.


1.8.0 Make a new subplot with information from across all clades

So we've looked at a few ways to subset and plot our phylogenetic tree. We'll do a quick aside to create a more curated list of tips with representation from across multiple clades. We'll use this as the base tree for our next few graphs so that we can really see the power of using additional information from our treedata object.


1.9.0 Convert your tree style with the layout parameter

So we've only been working with a rectangular tree (horizontal layout) thus far but as we mentioned at the beginning there are actually a number of different layouts we can use. We're going to make a sort of sunburst plot now using a circular layout that we'll add visual categorical information to in a ring pattern.

First, we'll start with the circular version. We'll colour our tips by the province where the case or data was generated and label them by the Nextstrain clade information.


1.10.0 Use gheatmap() to add information to your plot

Now that we have our basic circular tree, we can use the information from curated_tree_info.df to visualize additional data on our tree. In order to use the gheatmap layer properly, we must pass it our original tree plot, along with a new dataframe (or vector) that holds the information we want to add. It should use the rownames from the data frame to help match up to the tips in our tree.

We'll drop the tip labels from our tree in favour of the heatmap we'll add.


1.10.1 ggnewscale package allows multiple color/fill scales

From our sunburst plot above we could have actually used multiple columns from curated_tree_info.df but the caveat is that the legend for each resulting column will combine all of the data into a single new legend. That's helpful if the type of data could be continuous values like a [0,1] scale heatmap colour across different features. When dealing with multiple categories, however, that doesn't work well for us.

Instead, we'd like to separate the colour/fill scales by repeatedly adding to our plot, layer by layer. As you might recall, however, we also run into the problem of scale_colour_* and scale_fill_* layers overwriting any previous layers.

To circumvent this reality, we'll use the ggnewscale package to generate new colour and fill scales with new_scale_fill() and new_scale_colour(). It can make the process slightly more encumbering but it will work out.

We'll add back in our original tip labels as well for this graph.


1.11.0 Annotate connections between tips with geom_taxalink()

Suppose we want to join to tips/taxa together to suggest that there is a relationship between them or to emphasize some association between nodes. To connect nodes or tips of the tree, we can use the geom_taxalink() layer.


1.12.0 There is definitely more to explore from the ggtree package

So we've spent quite a bit of time looking at phylogenetic trees and how to add extrenal data to tips and nodes. We've barely scratched the surface and there are a lot of additional functions within the ggtree package that you can work with to build amazing plots including adding fasta sequence data through msaplot(). You can experiment with labeling internal nodes, and specific clades as well. You can also facet_plot() rectangular trees with other kinds of plots!

We didn't even have time to combine all sorts of plot data with our circular trees using the ggtreeExtra package. Definitely check out Chapter 10 of the tree book to be inspired by the potential things you could do with more complex datasets like this:

ggtree_ggtreeExtra_microbiome.png


2.0.0 What are network diagrams and how can we use them?

We've seen a lot of trees today but a closely related structure is the network diagram. They share similar concepts in that both have nodes and branches. However, nodes are called vertices and can represent almost anything, branches between nodes are edges and can span across multiple nodes, instead of in a bifurcating tree relationship. Relationships between vertices can be bidirectional, and depending on what you'd like to present, multiple parallel edges may exist.

Network graphs are great for showing the interconnections between entities within your data. This kind of visualization can also help us realize where subtle connections occur (or don't occur!). Depending on how you've weighted or chosen edges, you can convey how strong relationships between entities are as well.

To work with graph data and plot it, we'll be using a couple of companion packages: tidygraph and ggraph. More about that later!

Since we already have some metadata about COVID genomes, let's see if we can't convert some of it to a network diagram to better understand strain information? We'll begin by selecting some information from our Nextstrain metadata set.


2.1.0 Use the tidygraph package to help prepare data

Now that we have some information that we want to work with, we want to convert that kind of data to something that can be interpreted into a graph. The tidygraph package provides a way to hook graph data into the tidyverse so that we can use common verbs and ideas to filter and work with it. Using this package we can convert our dataframe information into a tbl_graph object which is actually an igraph object.

The function we'll use to convert our data is as_tbl_graph() but it has some expectations about the data. The parameters of this function are:

If you have both a nodes and edges data frame you can certainly generate a table this way. However, we have a complex dataframe and we want to track all of the information in it. To that end we will add columns from and to based on variables that already exist and the graph nodes will be generated from this information. When we provide the data frame, the function will recognize the columns present and produce node-based data and generate edge characteristics from our other columns.


2.2.0 Plot your graph information with ggraph()

The ggraph() function is one of many from a package of the same name. This package adds geoms and architecture that is compatible with ggplot2 (for the most part). Some of the functions we are interested in are:

Component Function Description
Graph ggraph The equivalent of the call to ggplot, it sets up the basic information about the graph plot
Edge geom_edge_link Produces edges between points
Edge geom_edge_fan Produces edges between points but accounts for parallel edges, creating arcs that fan out
Edge geom_edge_parallel Produces multiple edges between points with parallel lines representing parallel edges
Edge geom_edge_loop Used to represent edges that start and end at the same node. Does not account for parallel edges
Node geom_node_point Add basic nodes to your graph
Node geom_node_circle Add nodes to your graph that can be scaled by the coordinate system

The ggraph() function takes in 3 parameters:


2.2.1 Use geom_edge_fan() to visualize parallel edges

So from our graph we can see some characteristics about how our case data is connected. Although most of the data is centred around genomes sequenced in North America so we can see that "North America" is a source region likely due the fact that we should use something like case_source instead for our from variable. We can change that if we want.

The way we see the edges, however also has all of them overlaying on top of each other. We know that this isn't going to be a 1:1 relationship and although we have coloured the edges, we only see the topmost edge in a stack of many. For a graph like this, if we want to see all of the edges, we can use the geom_edge_fan() command which will help us see all of our parallel edges in all of their splendour. Let's try it out.


2.2.2 Filter and choose your data carefully to better visualize it

The metadata we've chosen isn't very consistent and that can be the case for many datasets so you'll want to be sure of how you pick your nodes. Right now we are picking our from value based on the supposed source of the case. This can vary in consistency from a region (Asia) to a province (Ontario). Source regions also very wildly in our data set so we are kind of mixing many kinds of places in our graph. Definitely aim to be consisten when you're working with your data otherwise the nodes may have less meaning.

Let's try looking at case_source to case_location as it relates to data from Canada and Asia


2.3.0 There are many more visualization from ggraph

We've really only covered linear graph layouts in our data but there are actually many kinds of graphs that this package can produce. The layout parameter is the key to exploring all of the graph types available. Of course your data layout all has to make sense! You'll find great examples from data-imaginist.com like this circlepack graph:

circlepack.png


3.0.0 Sequence motifs and genomic markers

When working with data sometimes you may be working with sequence-specific data for analysis of motifs or you may have a large set of data describing genomic markers like single nucleotide variants. Two popular visualizations in this domain are the sequence logo and Manhattan plot.

This one is short and simple. If you have a series of sequences you'd like to plot based on frequency of each bases you see at each position, the sequence logo is for you. The ggseqlogo package offers a simple ggplot2-compatible set of geoms you can add to your graph in order to represent this data.

Your data can come in two flavours:

  1. A named list where each index position is a vector of sequences, and the names of each index can have meaning (like a transcription factor).
  2. A frequency matrix where columns correspond to positions, and rownames correspond to bases (or amino acids). Each entry is the number of times that a base is seen at a specific position. These matrices could be daisy-chained in a list.

Let's work with the former since it does have some flexibility and looks like something we recognize.


3.1.1 Plot your logo information with the ggseqlogo() wrapper function

We have two options to look at our sequence data. The ggseqlogo() is a wrapper that sets up the entire plot for us including if you'd like to facet the data. If you just want a single logo then you pass parts of the list individually.

If you'd like multiple logos, you can decided their layout with the ncol or nrow parameters. The other parameters are


3.1.2 Plot your logo information with geom_logo()

To plot our data we can use ggplot language that we are familiar with and add a layer to our plot with the geom_logo() function. In order for this to function properly, however, we need to define our parameters as:

These and other parameters can also be set in the ggseqlogo() function too.

3.2.0 The Manhattan plot for all your genome-position visualization needs

Named for it's visual similarity to the Manhattan skyline of singular towering skyscrapers scattered above low-level buildings, this plot is commonly used for visualizing genomic markers across an ordered linear axis like a series of chromosomes. The y-axis of the values can represent LOD scores or proportions of marker representation. This is frequently used to visualize GWAS or mapping data.

This is another scatterplot that we can generate directly in ggplot with some effort/setup or we can use a package like qqman.

First let's look at the kind of data. Fun fact! You can read in archived/zipped data as well, although be careful, GWAS files can be quite large!


3.2.1 Use the manhattan() function to generate a plot

To generate our Manhattan plot we can use the manhattan() wrapper function which will require four sets of parameter information:


3.2.2 Use the highlight parameter to identify markers of interest

Now that we've plotted our Manhattan plot, we can see some horizontal lines corresponding to minimum p-values of 1.0e-5 and 5.0e-8 although the significance of these can depend on sample size and allele frequencies of your SNPs.

Either way, you can highlight SNP results based on a provided list. In this case, we can generate a list ourselves by filter p-values.

4.0.0 Class Summary

With this final lecture we've covered the spectrum of visualizations covering the most basic of scatter- and barplots, graduating to boxplots, violin plots, and their variants. We've covered high-throughput data visualizations including volcano plots, heatmaps, and principal component analysis. Furthermore we looked at how simple the projection of high-dimension data can be with t-SNE and UMAP.

We finished our course today covering phylogenetic trees, network graphs, and some sequence analysis visualizations. Overall you now have the tools to wrangle data that may appear in all sorts of formats along with a better understanding of when and how to prepare some of the most common data visualizations in your burgeoning science careers.

Congratulations and good luck on your data science journey!


5.0.0 Weekly assignment

This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 12pm on April 29th, 2021.


6.0.0 References

The book of ggtree from the Yu Lab: https://yulab-smu.top/treedata-book/index.html

Chapter 10 of ggtree with amazing and complicated plots: https://yulab-smu.top/treedata-book/chapter10.html

More information on the ggraph package: https://cran.r-project.org/web/packages/ggraph/ggraph.pdf

ggseqlogo package information: https://cran.r-project.org/web/packages/ggseqlogo/ggseqlogo.pdf

Manhattan plot tutorial: https://www.r-graph-gallery.com/101_Manhattan_plot.html

GWAS p-value threshold: https://www.nature.com/articles/s41380-020-0670-3?proof=t

More on GWAS thresholds for significance: https://www.nature.com/articles/ejhg2015269.pdf